feat: Kornia GPU augmentation backend for detection training#874
feat: Kornia GPU augmentation backend for detection training#874
Conversation
- Add `augmentation_backend` field to `TrainConfig` (cpu/auto/gpu); cpu is the default - New `src/rfdetr/datasets/kornia_transforms.py`: registry of 8 transform factories, `build_kornia_pipeline`, `build_normalize`, `collate_boxes`/`unpack_boxes` box utilities - Wire `gpu_postprocess` flag through `coco.py` and `yolo.py` so CPU Albumentations augmentation and normalize are skipped when GPU path is active - Add `_setup_kornia_pipeline` + `on_after_batch_transfer` to `RFDETRDataModule`; segmentation models skip GPU aug (phase 2) with a one-time warning - Add `kornia>=0.7,<1` optional dep group in `pyproject.toml` - 12 new tests across `test_module_data.py` and `test_kornia_transforms.py` --- Co-authored-by: Claude Code <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an opt-in GPU-side augmentation path for detection training by introducing an augmentation_backend switch and routing normalization/augmentation to run after the batch is transferred to device (via RFDETRDataModule.on_after_batch_transfer), while keeping the existing CPU Albumentations pipeline as the default.
Changes:
- Add
TrainConfig.augmentation_backend("cpu" | "auto" | "gpu") and a new optional dependency groupkornia. - Thread a
gpu_postprocessflag through COCO/YOLO dataset builders so CPU Albumentations + Normalize can be skipped when GPU postprocessing is active. - Add DataModule logic + tests for backend resolution and the
on_after_batch_transferhook.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/rfdetr/training/module_data.py |
Adds Kornia pipeline setup and on_after_batch_transfer GPU postprocessing hook. |
src/rfdetr/datasets/coco.py |
Adds gpu_postprocess option to training transforms and wires backend flag through dataset builders. |
src/rfdetr/datasets/yolo.py |
Wires backend flag through Roboflow-from-YOLO builder. |
src/rfdetr/config.py |
Introduces augmentation_backend on TrainConfig. |
src/rfdetr/datasets/aug_config.py |
Documents Kornia GPU backend and Phase 1 limitations. |
pyproject.toml |
Adds optional dependency group kornia. |
tests/training/test_module_data.py |
Adds tests for backend resolution and on_after_batch_transfer. |
tests/training/conftest.py |
Adds autouse fixture to restore RFDETRDataModule.trainer property after tests. |
CHANGELOG.md |
Documents the new augmentation_backend feature. |
Comments suppressed due to low confidence (1)
src/rfdetr/datasets/coco.py:356
- The
make_coco_transforms()docstring and Args list no longer match behavior now thatgpu_postprocesscan skip Albumentations andNormalize()for the train split. Please document the newgpu_postprocessparameter and clarify that normalization is deferred to the DataModule GPU path when it’s enabled.
"""Build the standard COCO transform pipeline for a given dataset split.
Returns a composed transform that resizes images to the target ``resolution``
(with optional multi-scale jitter), applies Albumentations-based augmentations
during training, and normalises pixel values with ImageNet statistics.
For the ``"train"`` split the pipeline uses a two-branch ``OneOf`` between a
direct resize and a resize → random-crop → resize sequence (built via
:func:`_build_train_resize_config`), followed by the augmentation stack and
normalisation. For ``"val"``, ``"test"``, and ``"val_speed"`` only resize and
normalisation are applied — no augmentation.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…rmalize - Import get_logger and add module-level logger to o365.py - Detect augmentation_backend from args; emit WARNING when non-cpu (Phase 1 limitation: no aug_config support for O365) - Compute gpu_postprocess flag and pass to both make_coco_transforms / make_coco_transforms_square_div_64 calls Addresses review comment — HIGH blocking: double normalize for O365 users with augmentation_backend != 'cpu' (PR #874) --- Co-authored-by: Claude Code <noreply@anthropic.com>
…RNING
- Add _kornia_setup_done: bool = False in __init__ to prevent _setup_kornia_pipeline re-running on every setup('fit') call when the auto+no-CUDA/no-kornia fallback leaves _kornia_pipeline as None
- Switch the auto+no-CUDA fallback from logger.info to logger.warning (consistent with auto+no-kornia WARNING)
Addresses review comments — MEDIUM: setup guard re-runs in fallback path; inconsistent log levels (PR #874)
---
Co-authored-by: Claude Code <noreply@anthropic.com>
- _make_gaussian_blur: enforce blur_limit >= 3 after odd-rounding (Kornia requires kernel_size >= 3) - make_coco_transforms, make_coco_transforms_square_div_64: add gpu_postprocess to Args docstring - unpack_boxes: correct docstring claiming in-place mutation (function returns shallow copies) - conftest.py: fix docstring wording LightningModule → LightningDataModule Addresses review comments from @Copilot and @review on PR #874 --- Co-authored-by: Claude Code <noreply@anthropic.com>
…pipeline forward pass
- TestGaussianBlurMinKernel: parametrized test for blur_limit=1,2 producing valid kernel_size >= 3
- TestKorniaPipelineForwardPass: shape/dtype check and empty-bbox batch through built pipeline (kornia skip guard)
- TestBuildO365RawGpuBackend: warning emitted for non-cpu backend; gpu_postprocess wired correctly; square-resize delegate
- TestKorniaSetupDoneSentinel: sentinel starts False, set after fit, _setup_kornia_pipeline called exactly once across repeated setup('fit') calls
Closes review test-coverage gaps from PR #874
---
Co-authored-by: Claude Code <noreply@anthropic.com>
When augmentation_backend != 'cpu' and aug_config is not explicitly set,
build_kornia_pipeline was receiving {} (empty dict) while the CPU path
correctly fell back to AUG_CONFIG. This caused GPU training to have zero
augmentation by default — a silent regression.
- Import AUG_CONFIG in module_data.py
- Use `train_config.aug_config if ... is not None else AUG_CONFIG` in _setup_kornia_pipeline
- Add test_gpu_path_uses_aug_config_fallback to TestBackendResolution
Addresses QA finding B1 (blocking) from post-commit review of PR #874
---
Co-authored-by: Claude Code <noreply@anthropic.com>
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (73%) is below the target coverage (95%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #874 +/- ##
=======================================
- Coverage 79% 79% -0%
=======================================
Files 97 98 +1
Lines 7819 8044 +225
=======================================
+ Hits 6169 6340 +171
- Misses 1650 1704 +54 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…auto backend
- _make_gaussian_blur: kernel_size=(blur_limit, blur_limit) instead of (3, blur_limit) — square kernel per Albumentations semantics (Copilot #2991808605)
- build_normalize: pass plain Python tuples instead of torch.tensor() so Kornia handles device placement (Copilot #2991808573)
- on_after_batch_transfer: call .to(img.device) on pipeline and normalize before use to prevent CPU/GPU device mismatch (Copilot #2991808540)
- setup("fit"): resolve 'auto' backend via _resolve_augmentation_backend() before dataset build so gpu_postprocess matches actual runtime behavior — fixes silent CPU-normalize stripping on machines without CUDA/kornia (Copilot #2991808618, #2991808669)
- 4 new tests covering _resolve_augmentation_backend and namespace pre-resolution
---
Co-authored-by: Claude Code <noreply@anthropic.com>
The try block unconditionally set has_kornia=True with no import, making the except unreachable; on machines with CUDA but without kornia, auto would incorrectly resolve to gpu — causing ImportError or unnormalized training inputs (review HIGH finding). Also update test_auto_backend_emits_warning and test_gpu_postprocess_true_for_auto_backend to mock CUDA+kornia availability so the GPU path is actually exercised; add complementary no-CUDA tests. --- Co-authored-by: Claude Code <noreply@anthropic.com>
| boxes_padded, valid = collate_boxes(targets, img.device) | ||
| img_aug, boxes_aug = self._kornia_pipeline(img, boxes_padded) | ||
| img_aug = self._kornia_normalize(img_aug) | ||
| targets = unpack_boxes(boxes_aug, valid, targets, *img_aug.shape[-2:]) |
There was a problem hiding this comment.
GPU postprocess path currently normalizes images with Kornia but never converts targets[i]["boxes"] from absolute xyxy pixels into normalized cxcywh. In the CPU pipeline this conversion happens inside rfdetr.datasets.transforms.Normalize, and the matcher/criterion expects normalized cxcywh targets. After unpack_boxes(...), convert boxes to cxcywh and divide by [W, H, W, H] (and keep this in sync with any box filtering) so training receives the same target format as the CPU backend.
| targets = unpack_boxes(boxes_aug, valid, targets, *img_aug.shape[-2:]) | |
| targets = unpack_boxes(boxes_aug, valid, targets, *img_aug.shape[-2:]) | |
| height, width = img_aug.shape[-2:] | |
| for target in targets: | |
| boxes = target["boxes"] | |
| if boxes.numel() == 0: | |
| continue | |
| cx = (boxes[:, 0] + boxes[:, 2]) / 2 | |
| cy = (boxes[:, 1] + boxes[:, 3]) / 2 | |
| w = boxes[:, 2] - boxes[:, 0] | |
| h = boxes[:, 3] - boxes[:, 1] | |
| boxes_cxcywh = torch.stack((cx, cy, w, h), dim=-1) | |
| scale = torch.tensor([width, height, width, height], dtype=boxes.dtype, device=boxes.device) | |
| target["boxes"] = boxes_cxcywh / scale |
| if backend != "auto": | ||
| return backend | ||
| if not torch.cuda.is_available(): | ||
| return "cpu" | ||
| try: | ||
| import kornia.augmentation # noqa: F401 | ||
|
|
||
| return "gpu" | ||
| except ImportError: | ||
| return "cpu" |
There was a problem hiding this comment.
_resolve_augmentation_backend() calls torch.cuda.is_available(), which this repo explicitly tries to avoid in fork-based DDP/notebook strategies because it can initialize a CUDA driver context. Consider using the existing fork-safe device detection (rfdetr.config.DEVICE / torch.accelerator.current_accelerator logic) here (and in _setup_kornia_pipeline) to prevent regressions in ddp_notebook / fork workflows.
| expanded_scales = getattr(args, "expanded_scales", False) | ||
| do_random_resize_via_padding = getattr(args, "do_random_resize_via_padding", False) | ||
| patch_size = getattr(args, "patch_size", 16) | ||
| num_windows = getattr(args, "num_windows", 4) | ||
| aug_config = getattr(args, "aug_config", None) | ||
| gpu_postprocess = getattr(args, "augmentation_backend", "cpu") != "cpu" and not include_masks | ||
|
|
||
| if square_resize_div_64: | ||
| logger.info(f"Building Roboflow {image_set} dataset with square resize at resolution {resolution}") | ||
| dataset = CocoDetection( | ||
| img_folder, | ||
| ann_file, | ||
| transforms=make_coco_transforms_square_div_64( | ||
| image_set, | ||
| resolution, | ||
| multi_scale=multi_scale, | ||
| expanded_scales=expanded_scales, | ||
| skip_random_resize=not do_random_resize_via_padding, | ||
| patch_size=patch_size, | ||
| num_windows=num_windows, | ||
| aug_config=aug_config, | ||
| gpu_postprocess=gpu_postprocess, | ||
| ), |
There was a problem hiding this comment.
gpu_postprocess is derived from augmentation_backend != "cpu", which treats augmentation_backend="auto" as GPU even on machines without CUDA/kornia. If build_roboflow_from_coco() is invoked without the DataModule’s prior resolution step, this will incorrectly strip CPU Normalize/augmentation and yield unnormalized training inputs. Resolve "auto" to "cpu"|"gpu" (same logic as _resolve_augmentation_backend) before computing gpu_postprocess.
| expanded_scales = getattr(args, "expanded_scales", None) | ||
| do_random_resize_via_padding = getattr(args, "do_random_resize_via_padding", False) | ||
| patch_size = getattr(args, "patch_size", None) | ||
| num_windows = getattr(args, "num_windows", None) | ||
| aug_config = getattr(args, "aug_config", None) | ||
| gpu_postprocess = getattr(args, "augmentation_backend", "cpu") != "cpu" and not include_masks | ||
|
|
||
| if square_resize_div_64: | ||
| dataset = YoloDetection( | ||
| img_folder=str(img_folder), | ||
| lb_folder=str(lb_folder), | ||
| data_file=str(data_file), | ||
| transforms=make_coco_transforms_square_div_64( | ||
| image_set, | ||
| resolution, | ||
| multi_scale=multi_scale, | ||
| expanded_scales=expanded_scales, | ||
| skip_random_resize=not do_random_resize_via_padding, | ||
| patch_size=patch_size, | ||
| num_windows=num_windows, | ||
| aug_config=aug_config, | ||
| gpu_postprocess=gpu_postprocess, | ||
| ), | ||
| include_masks=include_masks, | ||
| ) | ||
| else: | ||
| dataset = YoloDetection( | ||
| img_folder=str(img_folder), | ||
| lb_folder=str(lb_folder), | ||
| data_file=str(data_file), | ||
| transforms=make_coco_transforms( | ||
| image_set, | ||
| resolution, | ||
| multi_scale=multi_scale, | ||
| expanded_scales=expanded_scales, | ||
| skip_random_resize=not do_random_resize_via_padding, | ||
| patch_size=patch_size, | ||
| num_windows=num_windows, | ||
| aug_config=aug_config, | ||
| gpu_postprocess=gpu_postprocess, | ||
| ), |
There was a problem hiding this comment.
Same issue as COCO Roboflow builder: gpu_postprocess = augmentation_backend != "cpu" treats augmentation_backend="auto" as GPU even when CUDA/kornia are unavailable, which can strip CPU Normalize/augmentation unexpectedly if this builder is called outside the DataModule. Resolve "auto" to an actual backend before deciding gpu_postprocess.
| square_resize_div_64 = getattr(args, "square_resize_div_64", False) | ||
| augmentation_backend = getattr(args, "augmentation_backend", "cpu") | ||
| resolved_backend = augmentation_backend | ||
|
|
||
| if augmentation_backend == "auto": | ||
| # Resolve 'auto' based on CUDA and kornia availability | ||
| has_cuda = False | ||
| has_kornia = False | ||
| try: | ||
| import torch | ||
|
|
||
| has_cuda = bool(torch.cuda.is_available()) | ||
| except Exception: | ||
| has_cuda = False | ||
|
|
||
| try: | ||
| import kornia.augmentation # noqa: F401 | ||
|
|
||
| has_kornia = True | ||
| except Exception: | ||
| has_kornia = False | ||
|
|
||
| if has_cuda and has_kornia: | ||
| resolved_backend = "gpu" | ||
| else: | ||
| resolved_backend = "cpu" | ||
|
|
||
| if resolved_backend != "cpu": | ||
| logger.warning( | ||
| "O365 dataset does not support custom aug_config in Phase 1 GPU augmentation; " | ||
| "Albumentations augmentation is skipped and normalization runs on GPU. " | ||
| "Pass augmentation_backend='cpu' for full CPU augmentation pipeline with O365." | ||
| ) | ||
| gpu_postprocess = resolved_backend != "cpu" |
There was a problem hiding this comment.
build_o365_raw() resolves only the "auto" backend; augmentation_backend="gpu" is not validated (no CUDA/kornia checks) and will still set gpu_postprocess=True, potentially producing unnormalized inputs if this dataset builder is used outside the Lightning DataModule. Align O365 behavior with the documented backend contract: for "gpu" raise when CUDA is unavailable and raise an ImportError with an install hint when kornia is missing. Also consider narrowing the broad except Exception blocks here to ImportError/RuntimeError so real errors aren’t silently swallowed.
| def test_training_true_applies_augmentation(self, tmp_path): | ||
| """When training=True and _kornia_pipeline is set, augmentation is applied.""" | ||
| dm = self._build_dm(tmp_path) | ||
| dm = self._attach_mock_trainer(dm, training=True) | ||
|
|
||
| samples, targets = self._make_kornia_batch() | ||
| img_aug = samples.tensors.clone() | ||
| # Mock pipeline returns (augmented_images, augmented_boxes) | ||
| boxes_padded = torch.tensor([[[2.0, 2.0, 10.0, 10.0]]] * 2) | ||
| mock_pipeline = MagicMock(return_value=(img_aug, boxes_padded)) | ||
| dm._kornia_pipeline = mock_pipeline | ||
|
|
||
| # Mock normalize to be a passthrough | ||
| dm._kornia_normalize = MagicMock(side_effect=lambda x: x) | ||
|
|
||
| result = dm.on_after_batch_transfer((samples, targets), dataloader_idx=0) | ||
|
|
||
| mock_pipeline.assert_called_once() | ||
| dm._kornia_normalize.assert_called_once() | ||
| assert isinstance(result, tuple) | ||
| assert len(result) == 2 | ||
|
|
There was a problem hiding this comment.
The new on_after_batch_transfer tests only assert that the pipeline/normalize mocks are called, but they don’t validate the most important contract: output images are normalized and target boxes are in the same format as the CPU pipeline (normalized cxcywh). Add assertions on result_targets[...]["boxes"] (shape/range/format) to catch regressions in the GPU postprocess path.
What does this PR do?
augmentation_backendfield toTrainConfig(cpu/auto/gpu); cpu is the defaultsrc/rfdetr/datasets/kornia_transforms.py: registry of 8 transform factories,build_kornia_pipeline,build_normalize,collate_boxes/unpack_boxesbox utilitiesgpu_postprocessflag throughcoco.pyandyolo.pyso CPU Albumentations augmentation and normalize are skipped when GPU path is active_setup_kornia_pipeline+on_after_batch_transfertoRFDETRDataModule; segmentation models skip GPU aug (phase 2) with a one-time warningkornia>=0.7,<1optional dep group inpyproject.tomltest_module_data.pyandtest_kornia_transforms.pyCloses #862
Type of Change
Testing
Additional Context